Toward a statistical mechanics of four letter words

نویسندگان

  • Greg J. Stephens
  • William Bialek
چکیده

We consider words as a network of interacting letters, and approximate the probability distribution of states taken on by this network. Despite the intuition that the rules of English spelling are highly combinatorial (and arbitrary), we find that maximum entropy models consistent with pairwise correlations among letters provide a surprisingly good approximation to the full statistics of four letter words, capturing ∼ 92% of the multi–information among letters and even ‘discovering’ real words that were not represented in the data from which the pairwise correlations were estimated. The maximum entropy model defines an energy landscape on the space of possible words, and local minima in this landscape account for nearly two–thirds of words used in written English.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical mechanics of letters in words.

We consider words as a network of interacting letters, and approximate the probability distribution of states taken on by this network. Despite the intuition that the rules of English spelling are highly combinatorial and arbitrary, we find that maximum entropy models consistent with pairwise correlations among letters provide a surprisingly good approximation to the full statistics of words, c...

متن کامل

Unfolding Visual Lexical Decision in Time

Visual lexical decision is a classical paradigm in psycholinguistics, and numerous studies have assessed the so-called "lexicality effect" (i.e., better performance with lexical than non-lexical stimuli). Far less is known about the dynamics of choice, because many studies measured overall reaction times, which are not informative about underlying processes. To unfold visual lexical decision in...

متن کامل

Reading sentences of uniform word length: Evidence for the adaptation of the preferred saccade length during reading.

In the current study, the effect of removing word length variability within sentences on spatial aspects of eye movements during reading was investigated. Participants read sentences that were uniform in terms of word length, with each sentence consisting entirely of three-, four-, or five-letter words, or a combination of these word lengths. Several interesting findings emerged. Adaptation of ...

متن کامل

Statistical mechanics of RNA folding: importance of alphabet size.

We construct a base-stacking model of RNA secondary-structure formation and use it to study the mapping from sequence to structure. There are strong, qualitative differences between two-letter and four- or six-letter alphabets. With only two kinds of bases, most sequences have many alternative folding configurations and are consequently thermally unstable. Stable ground states are found only fo...

متن کامل

Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels

Functional DNA sub-sequences and genome elements are spatially clustered through the genome just as keywords in literary texts. Therefore, some of the methods for ranking words in texts can also be used to compare different DNA sub-sequences. In analogy with the literary texts, here we claim that the distribution of distances between the successive sub-sequences (words) is q-exponential which i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/0801.0253  شماره 

صفحات  -

تاریخ انتشار 2007